Show the code
pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, extrafont, knitr, ggtext)Alicia
June 3, 2023
June 17, 2023
This exercise aims to use appropriate static and interactive statistical graphics methods to help FishEye identify companies that may be engaged in illegal fishing.
The original dataset was originated from Mini Challenge 3 of Vast Challenge 2023.
There is one file downloaded: MC3.json.
This exercise aims to answer Q1 of the challenge:
The code chunk below will be used to install and load the necessary R packages to meet the data preparation, data wrangling, data analysis and visualisation needs.
The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.
The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The report above reveals that there is not missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.
A plot below shows the distribution of variable type in the mc3_edges data table. This variable only consists of Beneficial Owner and Company Contacts. There are much more Beneficial Owners than Company Contacts.
id1 <- mc3_edges %>%
select(source) %>%
rename(id = source)
id2 <- mc3_edges %>%
select(target) %>%
rename(id = target)
mc3_nodes1 <- rbind(id1, id2) %>%
distinct() %>%
left_join(mc3_nodes,
unmatched = "drop")
mc3_graph <- tbl_graph(nodes = mc3_nodes1,
edges = mc3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())
mc3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph() +
labs(title = "Network model of mc3 data")
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame.
| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
The report above reveals that there is no missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.
A plot below shows the distribution of variable type in the mc3_nodes data table. This variable consists of Beneficial Owner, Company and Company Contacts. There are more Beneficial Owners than Companies than Company Contacts.
This section performs basic text sensing using appropriate functions of tidytext package.
This code counts the number of words related to fishing e.g. “fish”, “fishes”, “seafood”, “fishing”, etc. in the product_services column. Before that, we make the characters in product_services all lower case for ease of searching.
A plot of the distribution of nodes with fishing related descriptions in their product services is shown below.

It is observed that majority of the nodes are not related to fishing activities.
The word tokenisation have different meaning in different scientific domains. In text sensing, tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.
In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.
The two basic arguments to unnest_tokens() used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).
Next, we visualise the words extracted by using the code chunk below.

The bar chart reveals that the unique words contains some words that may not be useful to use e.g. “and” and “of”. We want to remove these words from your analysis as they are fillers used to compose a sentence.
Use tidytext package that has a function called stop_words that will help to clean up stop words.
stopwords_removed <- token_nodes %>%
anti_join(stop_words)
stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")
To make the data more clean and meaningful, we recode the words “character”, “0”, “unknown” and “related” in product_services field to NA. It is also found that there are words “characters” and characterization”. These are recoded to NA as well.
We observed there are lists in the source column of mc3_edges data table. To clean up the list, we first segregate out rows with lists from the mc3_edges data table into a separate mc3_edges_wlist data table.
As the source column is in chr format, we unlist the lists in source column of mc3_edges_wlist data table by removing characters “c(” and “)” and split the elements by “,” using str R package. Then, we remove any duplicate names using lapply. Next, we use the unnest_longer function to separate the elements into new rows.
Lastly, we merge back the rows with the original mc3_edges data table (less the list) to form the cleaned edges data table.
Now that we have a cleaned mc3 edges data table, we will use this to create a new nodes data table to ensure all sources and targets are captured in the nodes data table to facilitate accurate development and analysis of network graphs later.
We want to extract the source and target from the mc3_edges_cleaned data table and left join stopwords_removed nodes data table to form the new nodes table.
We first take a look at the stopwords_removed nodes data table. We observed there are duplicate ids, some of which having same id but different country. We want to combine those duplicate ids to remove duplication. This can be done by grouping by id and type and using the summarise function to concatenate the country and word. For revenue_omu, we take the median value as taking the sum may not be comparative to other single id nodes. Then, we use str_split function to split the characters by ” , ” and then use lapply to ensure no duplicates in each field of country column.
mc3_nodesclean <- stopwords_removed %>%
group_by(id, type) %>%
summarise(country = paste(country, collapse = " , "), revenue_omu = median(revenue_omu), word = paste(word, collapse = " , ")) %>%
ungroup()
mc3_nodesclean$country <- str_split(mc3_nodesclean$country, " , ")
mc3_nodesclean$country <- lapply(mc3_nodesclean$country, unique)
mc3_nodesclean <- mc3_nodesclean %>%
select(id, country, revenue_omu, word)Then, we look at the mc3_nodes_cleaned data table. We observed that the source in mc3_nodes_cleaned data table are all companies while the target are people’s names, which are aligned to type that comprises Beneficial Owner and Company Contacts. As such, we can safely assume that type belongs to target.
We extract the source (and create a new column type and name it as “Company”) and also extract target from the mc3_edges_cleaned data table and left join mc3_nodesclean data table to form the new nodes table.
id3 <- mc3_edges_cleaned %>%
select(source) %>%
rename(id = source) %>%
mutate(type = "Company")
id4 <- mc3_edges_cleaned %>%
select(target, type) %>%
rename(id = target)
mc3_nodes_cleaned <- rbind(id3, id4) %>%
distinct() %>%
left_join(mc3_nodesclean, by=c("id" = "id"),
unmatched = "drop")
mc3_graphcleaned <- tbl_graph(nodes = mc3_nodes_cleaned,
edges = mc3_edges_cleaned,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())We are now ready to analyse the following relationships:
Company (source) and Beneficial Owner (target)
Company (source) and Company Contacts (target)
We want to check the frequency distribution of Company ownership by Beneficial Owner.
First, select edges where source type = Company and target type = Beneficial Owner. Count number of duplicated target i.e. Beneficial Owner to find out how many companies are owned by the Beneficial Owner. Then plot the distribution.
mc3_edges_cb <- mc3_edges_cleaned %>%
filter(type == "Beneficial Owner") %>%
add_count(target) #This adds a column (n) to the table indicating the number of companies linked to the target.
ggplot(data = mc3_edges_cb,
aes(x = n)) +
geom_bar() +
labs(title = "Distribution of Number of Companies Owned by Beneficial Owner", x = "Number of Companies Owned", y = "Number of Beneficial Owners")
| source | target | type | weights | n |
|---|---|---|---|---|
| Adams Group | John Smith | Beneficial Owner | 1 | 11 |
| Faroe Islands Company World | John Smith | Beneficial Owner | 1 | 11 |
| Guzman-Chang | John Smith | Beneficial Owner | 1 | 11 |
| Peterson PLC | John Smith | Beneficial Owner | 1 | 11 |
| Ryan-Curry | John Smith | Beneficial Owner | 1 | 11 |
| The Salted Pearl Inc Pelican | John Smith | Beneficial Owner | 1 | 11 |
| hǎi zhé Herring Incorporated Logistics | John Smith | Beneficial Owner | 1 | 11 |
| Beachcombers Nautical Plc Carriers | John Smith | Beneficial Owner | 1 | 11 |
| SeaSplash Foods Corporation Freight | John Smith | Beneficial Owner | 1 | 11 |
| Kerala Market Oyj Freight | John Smith | Beneficial Owner | 1 | 11 |
It is observed that majority of the Beneficial Owners own only 1 company, which is normal. However, a minority of them own more than 4 companies. For example, John Smith owns the most companies (11). This is rather suspicious and should be investigated further.
First, we form new nodes data table by using source and target of the edges data table. This is to ensure that the nodes in nodes data tables include all the source and target values. Group_component <10 is used to identify the prominent communities.
mc3_edges_cb4 <- mc3_edges_cb %>%
filter(n>4)
id7 <- mc3_edges_cb4 %>%
select(source) %>%
rename(id = source)
id8 <- mc3_edges_cb4 %>%
select(target) %>%
rename(id = target)
mc3_nodes_cb4 <- rbind(id7, id8) %>%
distinct() %>%
left_join(mc3_nodes_cleaned, by=c("id" = "id"),
unmatched = "drop") %>%
filter(type %in% c("Company", "Beneficial Owner"))
mc3_graph_cb4 <- tbl_graph(nodes = mc3_nodes_cb4,
edges = mc3_edges_cb4,
directed = FALSE)
mc3_graph_cb4 <- mc3_graph_cb4 %>%
activate("nodes") %>%
mutate(group = group_components()) %>%
filter(group < 10)
edges_cb4_df <- mc3_graph_cb4 %>%
activate(edges) %>%
as_tibble()
nodes_cb4_df <- mc3_graph_cb4 %>%
activate(nodes) %>%
as_tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label, group)From the above graph, we can see that John Smith owns 11 companies. We extract info of the companies owned by John Smith and can only ascertain that one of them (Adams Group) is related to fishing activities. Beachcombers Nautical Plc Carriers is into products, mail, houses activities and it has an exceptionally high revenue_omu of 9891666.673. The rest has no info.
mc3_nodes_cb4 %>%
filter(id %in% c("Adams Group", "Faroe Islands Company World", "Guzman-Chang", "Peterson PLC", "Ryan-Curry", "The Salted Pearl Inc Pelican", "hǎi zhé Herring Incorporated Logistics", "Beachcombers Nautical Plc Carriers", "SeaSplash Foods Corporation Freight", "Kerala Market Oyj Freight", "Oka S.A. de C.V."))# A tibble: 10 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Adams Group Company <chr [1]> NA NA , NA…
2 Adams Group Company <chr [1]> 9056. range ,…
3 Adams Group Company <chr [1]> NA NA , NA…
4 Guzman-Chang Company <NULL> NA <NA>
5 Peterson PLC Company <NULL> NA <NA>
6 Ryan-Curry Company <NULL> NA <NA>
7 The Salted Pearl Inc Pelican Company <chr [1]> 8619. NA
8 hǎi zhé Herring Incorporated Logistics Company <chr [1]> 4767. NA
9 Beachcombers Nautical Plc Carriers Company <chr [1]> 9891667. product…
10 SeaSplash Foods Corporation Freight Company <NULL> NA <NA>
The 2nd Beneficial Owner who owns most companies (9) is Michael Johnson. We extract info of the companies owned by him and can only ascertain that of the 9 companies, only one of them (Baker and Sons) is related to fishing activities and which has significantly high revenue_omu of 104095830. SeaBass Leska N.V. International is in milling business. The rest has no info.
# A tibble: 11 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Baker and Sons Company <chr [1]> NA NA , NA , N…
2 Baker and Sons Company <chr [1]> 104095830. fish , fres…
3 Chen, Jones and Davis Company <NULL> NA <NA>
4 Hancock Inc Company <NULL> NA <NA>
5 Jones, Kennedy and Johnson Company <NULL> NA <NA>
6 Knight-Brown Company <NULL> NA <NA>
7 Miller, Wiggins and Smith Company <NULL> NA <NA>
8 SeaBass Leska N.V. International Company <chr [1]> 106543. vertical , …
9 Seashell Seekers ОАО International Company <chr [1]> 7926. NA
10 Thompson LLC Company <chr [1]> NA NA , NA , N…
11 Thompson LLC Company <chr [1]> NA NA , NA , N…
The 3rd Beneficial Owner who owns most companies (8) is Jennifer Smith. We extract info of the companies owned by her and observed that of the 8 companies, only Mar del Oeste - and Dutch Mussels S.p.A. Sea spray are related to fishing activities. Luangwa River Limited Liability Company Holdings is in chemicals business while Mar de Cristal ОАО is in dairy product business. The rest has no info.
# A tibble: 9 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Cortez Group Comp… <chr> NA NA ,…
2 Cortez Group Comp… <chr> NA NA ,…
3 Hamilton LLC Comp… <chr> NA NA ,…
4 Luangwa River Limited Liability Company Hol… Comp… <chr> NA chem…
5 Mar de Coral GmbH and Son's Comp… <chr> 9726. NA
6 Mar de Cristal ОАО Comp… <chr> 30027. flui…
7 Mar del Oeste - Comp… <chr> 57148. salm…
8 Dutch Mussels S.p.A. Sea spray Comp… <chr> 168248. gela…
9 Maacama S.p.A. Marine ecology Comp… <chr> 8996. NA
Next, we observed a rather large community where 3 companies seem to be of high betweenness centrality. They are namely:
BlueTide GmbH & Co. KG - in fabrication and metal products business
West Fish GmbH Transport - in veneer and wood business
Mar del Oeste - - in legit fishing business
These 3 companies are all owned by Jessica Brown, who owned a total of 5 companies. BlueTide GmbH & Co. KG is owned by David Thomas, Mar del Oeste is owned by Jennifer Smith and West Fish GmbH Transport is owned by Michael Miller. We will investigate further in below section when we examine the betweenness centrality between Company and Beneficial Owners.
# A tibble: 3 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 BlueTide GmbH & Co. KG Company <chr [1]> 24731. fabricated , metal , …
2 Mar del Oeste - Company <chr [1]> 57148. salmon , products
3 West Fish GmbH Transport Company <chr [1]> 13627. veneer , sheets , woo…
Centrality Betweenness is a way of detecting the amount of influence a node has over the flow of information in a graph. It finds nodes that serve as a bridge from one part of a graph to another and measures the shortest paths between all pairs of nodes in a graph. A node with higher betweenness centrality would have more control over the network.
First, we form new nodes data table by using source and target of the edges data table. This is to ensure that the nodes in nodes data tables include all the source and target values.
id5 <- mc3_edges_cb %>%
select(source) %>%
rename(id = source)
id6 <- mc3_edges_cb %>%
select(target) %>%
rename(id = target)
mc3_nodes_cb <- rbind(id5, id6) %>%
distinct() %>%
left_join(mc3_nodes_cleaned, by=c("id" = "id"),
unmatched = "drop") %>%
filter(type %in% c("Company", "Beneficial Owner"))
mc3_graph_cb <- tbl_graph(nodes = mc3_nodes_cb,
edges = mc3_edges_cb,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())We bin the betweenness centrality of the nodes for ease of visualization using VisNetwork later.
Due to the large data size, we filter data with betweenness > 500000 and community group < 10. Group_component is used to identify the prominent communities.
mc3_graph_cbcb <- mc3_graph_cb %>%
activate("nodes") %>%
mutate(group = cut(betweenness_centrality, breaks = c(0, 1000000, 2000000, 3000000, Inf),
labels = c("1\n(0-999999)",
"2\n(1000000-1999999)",
"3\n(2000000-2999999)",
"4\n(>=3000000)\n"),
include.lowest = TRUE)) %>%
arrange(desc(betweenness_centrality)) %>%
filter(betweenness_centrality > 500000) %>%
mutate(group1 = group_components()) %>%
filter(group1 < 10)
mc3_graph_cbcb <- mc3_graph_cbcb %>%
activate("edges") %>%
mutate(importance = centrality_edge_betweenness()) The network graph showing the relationship between Company and Beneficial Owner is plotted.
set.seed(1234)
ggraph(mc3_graph_cbcb,
layout = "stress") +
geom_edge_link(aes(colour = importance),
alpha=0.2) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(size = betweenness_centrality, colour = factor(group1))) +
theme_graph() +
geom_node_text(aes(label = id), size = 1, repel=TRUE) +
ggtitle("<span style='font-size: 15pt;'>Betweenness Centrality Network (Company-Beneficial Owner)</font>") +
theme(plot.title = element_markdown())
| id | type | country | revenue_omu | word | betweenness_centrality | closeness_centrality | group | group1 |
|---|---|---|---|---|---|---|---|---|
| Senegal Coast Ltd. Liability Co | Company | Oceanus | NA | NA | 3399060 | 2.09e-05 | 4 | |
| (>=3000000) | 1 | |||||||
| Jessica Brown | Beneficial Owner | NULL | NA | NA | 2793321 | 1.95e-05 | 3 | |
| (2000000-2999999) | 1 | |||||||
| Ocean Observers Marine mist | Company | Puerto Sol | 39678.54 | transportation , NA , services | 2691699 | 1.86e-05 | 3 | |
| (2000000-2999999) | 1 | |||||||
| BlueTide GmbH & Co. KG | Company | Azurionix | 24730.50 | fabricated , metal , products | 2542636 | 1.86e-05 | 3 | |
| (2000000-2999999) | 1 | |||||||
| David Thomas | Beneficial Owner | NULL | NA | NA | 2500803 | 1.77e-05 | 3 | |
| (2000000-2999999) | 1 |
The following configurations should be noted before we interpret the graphs:
Node size is set to betweenness centrality
Node colour is set to community group
Edge width is set to number of companies
Edge colour is set to centrality_edge_betweenness
It is observed that the top 5 ids with highest betweenness centrality are:
Senegal Coast Ltd. Liability Co
Jessica Brown
Ocean Observers Marine mist
BlueTide GmbH & Co. KG
David Thomas
This means that the above companies/beneficial owners have higher control over the network. However, from the data table, there is no info on the type of services that Senegal Coast Ltd. Liability Co provides while the other 2 companies are not related to fishing activities. Ocean Observers Marine mist is into transportation while BlueTide GmbH & Co. KG is into fabrication and metal products. Following down the list, we observed that only Congo Rapids Ltd. Corporation is related to fishing activities.
It is also observed that although John Smith owns the most number of companies, he does not have as high betweenness centrality i.e. control over the network as compared to other owners who own fewer companies.
The graph did not reflect all the links e.g. although John Smith owns the most companies, this graph only shows 2. This could be due to companies with betweenness <= 500000 being removed before plotting the graph.
In the next step, we plot the interactive graph using VisNetwork to better visualize who are the beneficial owners of the non-fishing companies.
The different colour groups represent the different range of betweenness centrality with blue having the highest range, which Senegal Coast Ltd. Liability Co fall into.
As earlier mentioned, the graph did not reflect all the links. This could be due to companies with betweenness <= 500000 being removed before plotting the graph.
Senegal Coast Ltd. Liability Co has the highest betweenness centrality. We extract the company from the edges table and found that it is owned by 22 people, although the graph only showed 4 links. The other 18 people should have betweenness centrality <=500000 and therefore were not shown. We only know it is from Oceanus, but there is no info on its revenue_omu and product services. As such, no findings can be drawn. But this company is highly suspicious as it does not make sense for it to have so many owners.
# A tibble: 22 × 5
source target type weights n
<chr> <chr> <chr> <int> <int>
1 Senegal Coast Ltd. Liability Co Amanda West Beneficia… 1 1
2 Senegal Coast Ltd. Liability Co Angela Ward Beneficia… 1 2
3 Senegal Coast Ltd. Liability Co Ashlee Campbell Beneficia… 1 1
4 Senegal Coast Ltd. Liability Co Brooke Lawson Beneficia… 1 1
5 Senegal Coast Ltd. Liability Co Carlos Harrell Beneficia… 1 1
6 Senegal Coast Ltd. Liability Co Daniel Davis Beneficia… 1 3
7 Senegal Coast Ltd. Liability Co Emily Marshall Beneficia… 1 1
8 Senegal Coast Ltd. Liability Co Erica Sanchez Beneficia… 1 1
9 Senegal Coast Ltd. Liability Co Erin Alvarez Beneficia… 1 1
10 Senegal Coast Ltd. Liability Co Francisco Singleton Beneficia… 1 1
# ℹ 12 more rows
# A tibble: 1 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Senegal Coast Ltd. Liability Co Company <chr [1]> NA NA
Jessica Brown is the 2nd highest in betweenness centrality and she owns 5 companies as we can see from the table below.
# A tibble: 5 × 5
source target type weights n
<chr> <chr> <chr> <int> <int>
1 Bauer-Taylor Jessica Brown Beneficial Owner 1 5
2 BlueTide GmbH & Co. KG Jessica Brown Beneficial Owner 1 5
3 Mar del Oeste - Jessica Brown Beneficial Owner 1 5
4 Mcintyre-White Jessica Brown Beneficial Owner 1 5
5 West Fish GmbH Transport Jessica Brown Beneficial Owner 1 5
We extract the company info and observed that only Mar del Oeste - is related to fishing activities. The other 4 companies have either no info or are doing other businesses (metal/wood and veneer). Mar del Oeste -has much higher revenue_omu as compared to the other 4 companies.
# A tibble: 5 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Bauer-Taylor Company <NULL> NA <NA>
2 BlueTide GmbH & Co. KG Company <chr [1]> 24731. fabricated , metal , …
3 Mar del Oeste - Company <chr [1]> 57148. salmon , products
4 Mcintyre-White Company <NULL> NA <NA>
5 West Fish GmbH Transport Company <chr [1]> 13627. veneer , sheets , woo…
The 3rd highest in betweenness centrality is Ocean Observers Marine mist which is owned by 24 people. It is in transportation business with revenue_omu of 39678.54. This company is also highly suspicious as it does not make sense for it to have so many owners and the revenue_omu is not high too.
# A tibble: 24 × 5
source target type weights n
<chr> <chr> <chr> <int> <int>
1 Ocean Observers Marine mist Amanda Smith Beneficial Own… 1 4
2 Ocean Observers Marine mist Brendan Brown Beneficial Own… 1 1
3 Ocean Observers Marine mist Cindy White Beneficial Own… 1 1
4 Ocean Observers Marine mist Don Mooney Beneficial Own… 1 1
5 Ocean Observers Marine mist Douglas Park Beneficial Own… 1 1
6 Ocean Observers Marine mist Elizabeth Nicholson Beneficial Own… 1 1
7 Ocean Observers Marine mist Ethan Thomas Beneficial Own… 1 1
8 Ocean Observers Marine mist Gary Rodriguez Beneficial Own… 1 1
9 Ocean Observers Marine mist Jacob Gonzalez Beneficial Own… 1 1
10 Ocean Observers Marine mist Jamie Byrd Beneficial Own… 1 1
# ℹ 14 more rows
# A tibble: 1 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Ocean Observers Marine mist Company <chr [1]> 39679. transportation , NA…
The 4th highest in betweenness centrality is BlueTide GmbH & Co. KG which is owned by 48 people. It is in fabrication and metal products business with revenue_omu of 24730.5. This company is highly suspicious too as it does not make sense for it to have so many owners and the revenue_omu is low.
# A tibble: 48 × 5
source target type weights n
<chr> <chr> <chr> <int> <int>
1 BlueTide GmbH & Co. KG Adam Hall Beneficial Owner 1 1
2 BlueTide GmbH & Co. KG Angie Braun Beneficial Owner 1 1
3 BlueTide GmbH & Co. KG Ann Forbes Beneficial Owner 1 1
4 BlueTide GmbH & Co. KG Ashley Johns Beneficial Owner 1 1
5 BlueTide GmbH & Co. KG Austin Lowe Beneficial Owner 1 1
6 BlueTide GmbH & Co. KG Barbara Brady Beneficial Owner 1 1
7 BlueTide GmbH & Co. KG Brian Holland Beneficial Owner 1 1
8 BlueTide GmbH & Co. KG Brianna Lee Beneficial Owner 1 1
9 BlueTide GmbH & Co. KG Charles Turner Beneficial Owner 1 1
10 BlueTide GmbH & Co. KG Christine Cummings Beneficial Owner 1 1
# ℹ 38 more rows
# A tibble: 1 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 BlueTide GmbH & Co. KG Company <chr [1]> 24731. fabricated , metal , pro…
The 5th highest in betweenness centrality is David Thomas and he owns 6 companies as we can see from the table below.
# A tibble: 6 × 5
source target type weights n
<chr> <chr> <chr> <int> <int>
1 BlueTide GmbH & Co. KG David Thom… Bene… 1 6
2 Nagaland Sea Catch Ltd. Liability Co Logistics David Thom… Bene… 1 6
3 Ocean Quest S.A. de C.V. David Thom… Bene… 1 6
4 Rubio-Evans David Thom… Bene… 1 6
5 Marine Muse Pic Marine ecology David Thom… Bene… 1 6
6 Andhra Pradesh Limited Liability Company Ray David Thom… Bene… 1 6
We extract the company info and observed that only Nagaland Sea Catch Ltd. Liability Co Logistics is related to fishing activities. The other 5 companies have either no info or are doing other businesses.
# A tibble: 6 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Bauer-Taylor Comp… <NULL> NA <NA>
2 BlueTide GmbH & Co. KG Comp… <chr> 24731. fabr…
3 Nagaland Sea Catch Ltd. Liability Co Logistics Comp… <chr> 54913. proc…
4 Ocean Quest S.A. de C.V. Comp… <chr> 46704. cook…
5 Rubio-Evans Comp… <NULL> NA <NA>
6 Marine Muse Pic Marine ecology Comp… <chr> 5357. arra…
We want to check the frequency distribution of Company vs Company Contacts.
First, select edges where source type = Company and target type = Company Contacts. Count number of duplicated target i.e. Company Contacts to find out how many companies are linked to each contact. Then plot the distribution.
mc3_edges_cc <- mc3_edges_cleaned %>%
filter(type == "Company Contacts") %>%
add_count(target) #This adds a column (n) to the table indicating the number of companies linked to the Company Contact.
ggplot(data = mc3_edges_cc,
aes(x = n)) +
geom_bar() +
labs(title = "Distribution of Number of Companies linked to Company Contacts", x = "Number of Companies Linked", y = "Number of Company Contacts")
| source | target | type | weights | n |
|---|---|---|---|---|
| Náutica del Sol Ges.m.b.H. | Angela Wood | Company Contacts | 1 | 7 |
| Náutica del Mar S.A. de C.V. Carriers | Angela Wood | Company Contacts | 1 | 7 |
| Costa del Sol Carriers | Angela Wood | Company Contacts | 1 | 7 |
| Ancla de Oro United Yacht | Angela Wood | Company Contacts | 1 | 7 |
| Playa de Oro Company | Angela Wood | Company Contacts | 1 | 7 |
| Sparrmans Swordfish Ges.m.b.H. Merchants | Angela Wood | Company Contacts | 1 | 7 |
| Gulf of Guinea Oceanography | Angela Wood | Company Contacts | 1 | 7 |
| PacificPlates S.A. de C.V. | Mr. Jason Carrillo | Company Contacts | 1 | 6 |
| Baltic Sprat Ges.m.b.H. Enterprises | Mr. Jason Carrillo | Company Contacts | 1 | 6 |
| Rufiji Delta Limited Liability Company | Mr. Jason Carrillo | Company Contacts | 1 | 6 |
It is observed that majority of the Company Contacts are associated with only 1 company, which is normal. However, a minority of them are associated with 4 or more companies. For example, Angela Wood is associated with 7 companies. This is rather suspicious and should be investigated further.
We also want to find out which company has the most number of contacts.
mc3_edges_cc1 <- mc3_edges_cleaned %>%
filter(type == "Company Contacts") %>%
add_count(source) #This adds a column (n) to the table indicating the number of contacts linked to the company.
ggplot(data = mc3_edges_cc1,
aes(x = n)) +
geom_bar() +
labs(title = "Distribution of Number of Company Contacts", x = "Number of Company Contacts", y = "Number of Companies")
| source | target | type | weights | n |
|---|---|---|---|---|
| Aqua Aura SE Marine life | Anthony Hunter | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Heather Erickson | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Jesus Mcclure | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Leon Pittman | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Mrs. Amy Graves | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Sarah Barrett | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Shannon Snyder | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Thomas Snyder | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Jennifer Bauer | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Jillian White | Company Contacts | 1 | 11 |
| Aqua Aura SE Marine life | Leah Cruz | Company Contacts | 1 | 11 |
| Irish Mackerel S.A. de C.V. Marine biology | Angelica Gates | Company Contacts | 1 | 7 |
| Irish Mackerel S.A. de C.V. Marine biology | Mario Lee | Company Contacts | 1 | 7 |
| Irish Mackerel S.A. de C.V. Marine biology | Tara Cooper | Company Contacts | 1 | 7 |
| Irish Mackerel S.A. de C.V. Marine biology | Wendy Gardner | Company Contacts | 1 | 7 |
It is observed that majority of the companies are associated with only 1 Company Contact. However, a minority of them are associated with 4 or more contacts For example, Aqua Aura SE Marine life has 11 contacts This is rather suspicious and should be investigated further.
First, we form new nodes data table by using source and target of the edges data table. This is to ensure that the nodes in nodes data tables include all the source and target values. Group_component <10 is used to identify the prominent communities.
mc3_edges_cc4 <- mc3_edges_cc %>%
filter(n>3)
id9 <- mc3_edges_cc4 %>%
select(source) %>%
rename(id = source)
id10 <- mc3_edges_cc4 %>%
select(target) %>%
rename(id = target)
mc3_nodes_cc4 <- rbind(id9, id10) %>%
distinct() %>%
left_join(mc3_nodes_cleaned, by=c("id" = "id"),
unmatched = "drop") %>%
filter(type %in% c("Company", "Company Contacts"))
mc3_graph_cc4 <- tbl_graph(nodes = mc3_nodes_cc4,
edges = mc3_edges_cc4,
directed = FALSE)
mc3_graph_cc4 <- mc3_graph_cc4 %>%
activate("nodes") %>%
mutate(group = group_components()) %>%
filter(group < 10)
edges_cc4_df <- mc3_graph_cc4 %>%
activate(edges) %>%
as_tibble()
nodes_cc4_df <- mc3_graph_cc4 %>%
activate(nodes) %>%
as_tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label, group)From the above graph, we can see that Angela Wood has contacts with 7 companies. We extract info of the companies associated with Angela Wood and cannot find any info on revenue_omu nor product services other than the info that Náutica del Sol Ges.m.b.H. is involved in industrial adhesives business.
# A tibble: 7 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Náutica del Sol Ges.m.b.H. Company <chr [2]> NA indus…
2 Náutica del Mar S.A. de C.V. Carriers Company <NULL> NA <NA>
3 Costa del Sol Carriers Company <NULL> NA <NA>
4 Ancla de Oro United Yacht Company <NULL> NA <NA>
5 Playa de Oro Company Company <NULL> NA <NA>
6 Sparrmans Swordfish Ges.m.b.H. Merchants Company <NULL> NA <NA>
7 Gulf of Guinea Oceanography Company <NULL> NA <NA>
The 2nd Beneficial Owner withe most contacts (6) is Jason Carillo. We extract info of the companies associated with him and cannot find any insights as there is no info on revenue_omu and product services.
# A tibble: 6 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 PacificPlates S.A. de C.V. Company <chr [1]> NA NA
2 Baltic Sprat Ges.m.b.H. Enterprises Company <NULL> NA <NA>
3 Rufiji Delta Limited Liability Company Company <chr [1]> NA NA
4 jīn qiāng yú AG Company <NULL> NA <NA>
5 Aqua Adventures Ltd. Corporation Company <NULL> NA <NA>
6 Náutica del Mar GmbH & Co. KG Company <NULL> NA <NA>
Jennifer Johnson is one of the Beneficial Owner with the 3rd highest contacts (5). We extract info of the companies associated with her and could only find that House Inc is in glass and packaging business with a relatively high revenue_omu of 157513.702 while Tshikwea S.A. de C.V. is in stationery business with revenue_omu of 25221.835. There is no info on the other 3 companies.
# A tibble: 6 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 House Inc Company <chr [1]> 157514. glass , packaging , …
2 Mar del Golfo Incorporated Company <chr [1]> 8657. NA
3 Rodriguez and Sons Company <chr [1]> NA NA , NA , NA , NA , …
4 Rodriguez and Sons Company <chr [1]> NA NA , NA
5 Silva-Cabrera Company <NULL> NA <NA>
6 Tshikwea S.A. de C.V. Company <chr [1]> 25222. offers , range , sta…
First, we form new nodes data table by using source and target of the edges data table. This is to ensure that the nodes in nodes data tables include all the source and target values.
id11 <- mc3_edges_cc %>%
select(source) %>%
rename(id = source)
id12 <- mc3_edges_cc %>%
select(target) %>%
rename(id = target)
mc3_nodes_cc <- rbind(id11, id12) %>%
distinct() %>%
left_join(mc3_nodes_cleaned, by=c("id" = "id"),
unmatched = "drop") %>%
filter(type %in% c("Company", "Company Contacts"))
mc3_graph_cc <- tbl_graph(nodes = mc3_nodes_cc,
edges = mc3_edges_cc,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())Due to the large data size, we filter data with closeness centrality > 0 and community group < 10. Group_component is used to identify the prominent communities.
mc3_graph_cccb <- mc3_graph_cc %>%
activate("nodes") %>%
arrange(desc(betweenness_centrality)) %>%
filter(betweenness_centrality > 0) %>%
mutate(group1 = group_components()) %>%
filter(group1 < 10)
mc3_graph_cccb <- mc3_graph_cccb %>%
activate("edges") %>%
mutate(importance = centrality_edge_betweenness()) The network graph showing the relationship between Company and Beneficial Owner is plotted.
set.seed(1234)
ggraph(mc3_graph_cccb,
layout = "fr") +
geom_edge_link(aes(colour = importance),
alpha=0.2) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(size = betweenness_centrality, colour = factor(group1))) +
theme_graph() +
geom_node_text(aes(label = id), size = 1, repel=TRUE) +
ggtitle("<span style='font-size: 14pt;'>Betweenness Centrality Network (Company-Company Contacts)</font>") +
theme(plot.title = element_markdown())
| id | type | country | revenue_omu | word | betweenness_centrality | closeness_centrality | group1 |
|---|---|---|---|---|---|---|---|
| Aqua Aura SE Marine life | Company | Mawazam , Rio Isla , Icarnia , Oceanus , Nalakond , Coralmarica, Alverossia , Isliandor , Talandria | NA | food , ingredients , frozen , fruits , vegetables , ready , eat , products , ready , cook , products , snacks , drinks , kitchen , accessories , seafood , aquatic , products , bone , ash , natural , calcium , phosphate , feldspar , prepared , bone , china , bodies , NA , NA , fish , seafood , products , tuna , salmon , herring , shellfish , groundfish , products , flounder , fillets , cornmeal , pollock , strips , burger , tuna , steak , frozen , halibut , steaks , canned , sockeye , salmon , frozen , sockeye , crabs , alaskan , seafood , domestic , caviar , imported , caviar , NA , NA , NA , optical , fiber , communication , passive , components , fiber , optical , attenuator , fiber , optical , isolator , fiber , optical , switch , optical , filter , fiber , optical , fiber , optical , coupler , NA | 108.5 | 0.0357143 | 4 |
| Irish Mackerel S.A. de C.V. Marine biology | Company | Oceanus | NA | transportation , services , ceramic , resin , home , garden , decor , source , freelance , researcher , involved , retailing , fresh , frozen , cured , meats , poultry , NA , NA , NA , NA , NA , NA , NA , NA , NA | 70.5 | 0.0277778 | 4 |
| Jillian White | Company Contacts | NULL | NA | NA | 30.0 | 0.0312500 | 4 |
| Leah Cruz | Company Contacts | NULL | NA | NA | 30.0 | 0.0312500 | 4 |
| Mar del Norte NV | Company | Oceanus, Marebak | 22966.67 | seafoods , fish , NA , seafood , products , NA | 21.0 | 0.0769231 | 5 |
The following configurations should be noted before we interpret the graphs:
Node size is set to betweenness centrality
Node colour is set to community group
Edge width is set to number of companies linked to the Company Contact
Edge colour is set to centrality_edge_betweenness
It is observed that the top 5 ids with highest betweenness centrality are:
Aqua Aura SE Marine life
Irish Mackerel S.A. de C.V. Marine biology
Jillian White
Leah Cruz
Mar del Norte NV
This means that the above companies/Company Contacts have higher control over the network. Aqua Aura SE Marine life has the highest betweenness centrality, which is consistent with the earlier finding that it has the most contacts (11). However, there is no info on its revenue_omu to aid the analysis. Both Aqua Aura SE Marine life and Mar del Norte NV are in fishing business while Irish Mackerel S.A. de C.V. Marine biology may possibly be in the fishing business too as the product services description mentioned fresh, frozen, meats.
# A tibble: 3 × 5
id type country revenue_omu word
<chr> <chr> <list> <dbl> <chr>
1 Aqua Aura SE Marine life Company <chr [9]> NA food…
2 Irish Mackerel S.A. de C.V. Marine biology Company <chr [1]> NA tran…
3 Mar del Norte NV Company <chr [2]> 22967. seaf…
It is also observed that although Angela Wood is associated with the most number of companies, she does not have as high betweenness centrality i.e. control over the network as compared to other contacts who are associated with fewer companies.
In conclusion, majority of the Beneficial Owners own only 1 company, which is normal. However, a minority of them own more than 4 companies, which is suspicious. For example, John Smith owns the most number of companies (11). We can only ascertain that one of the companies (Adams Group) is related to fishing activities. Beachcombers Nautical Plc Carriers is into products, mail, houses activities and it has an exceptionally high revenue_omu of 9891666.673. There is no info for the rest of the companies.
It is also observed that majority of the Company Contacts are associated with only 1 company, which is normal. However, a minority of them are associated with 4 or more companies, which is rather suspicious. For example, Angela Wood has contacts with the most number of companies (7). However, there is no info on revenue_omu nor product services for all the 7 companies other than the info that Náutica del Sol Ges.m.b.H. is involved in industrial adhesives business. It is also suspicious that Aqua Aura SE Marine life has so many Company Contacts (11).